The latest version of this document can be found on the Marioni Lab GitHub page
library(ggplot2)
library(reshape2)
library(parallel)
library(Matrix)
library(pheatmap)
library(RColorBrewer)
library(reshape2)
library(scales)
library(ggsignif)
library(viridis)
# library(knitr)
# opts_knit$set(eval.after = 'fig.cap')
library(quadprog)
library(BiocStyle)
#change this to run the code easily
folder_location = "/Users/griffi01/Dropbox/BarcodeSwapping2017/"
With the rapid increase in throughput of next-generation sequencing technology, an individual run of a typical sequencing machine (e.g. those produced by Illumina) produces many more reads than is necessary for interpreting libraries generated by most functional genomic assays. Consequently, to make efficient use of these machines, DNA libraries are pooled together prior to sequencing, a process known as “multiplexing”. The ligation of unique barcodes onto DNA molecules within each library before pooling incorporates a known sequence into each sequence read, allowing the separation of reads from different libraries after sequencing is completed. Multiplexing also protects against technical artefacts between samples (e.g. batch effects) and sample loss due to sequencing lane failures. As such, multiplexing is widely considerd to be good practice for many experiments, and is essential for cost effective analysis of small libraries such as those in single-cell RNA-sequencing experiments.
The most recent DNA-sequencing machines released by Illumina (HiSeq 3000/4000/X, X-Ten, and NovaSeq) use patterned flow cells to increase throughput and cost efficiency. On these new flow cells, the process of “seeding” DNA molecules into the patterned wells and amplification of the seeded DNA occur simultaneously. These machines have been in use for several years in a diverse range of genomic fields.
knitr::include_graphics(paste0(folder_location, "/figs/schematic.png"))
Figure 1: Swapping schematic
The method by which swapping is proposed to take place, from Sinha et al., is shown.
However, it has been recently reported that these machines can lead to the mislabelling of DNA molecules with the incorrect library barcode. A schematic showing the swapping method proposed by Sinha et al. (2017) is shown in Figure 1 The mislabelling is likely driven by the extension of free barcode molecules using other DNA molecules as a template. The phenomenon has been acknowledged by Illumina (Illumina 2017), although estimates of swapping rates vary between reports. It is unclear whether a permanent solution to the problem will be forthcoming, as rapid amplification after seeding is critical to the operation of the patterned flow cell machines (Sinha et al. 2017).
This “swapping” (also called “hopping” or “switching”) of barcode labels can easily confound analyses: reads labelled with a barcode specific to a given sample may have originated from any other sample multiplexed with it, rather than coming from the sample itself. This phenomenon is particularly relevant for rapidly developing single-cell techniques, where a large number of samples are necessarily multiplexed together.
This file steps through the analysis upon which the manuscript Detection and Removal of Barcode Swapping in Single-Cell RNA-seq data is based, with all code used present and accessible by clicking on the Code buttons.
Before continuing, we define a few terms:
The swapping rate is the total proportion of transcripts that are moving between all libraries in a sequencing lane due to barcode swapping
Swapping occurs between a “donor” library, from which the transcript originated, and “recipient” libraries, in which the swapped read is detected following sequencing.
All data presented here was sequenced at the same core facility, at CRUK CI, Cambridge, UK.
libs = read.table(paste0(folder_location, "/data/richard_4000.tab"), header = T, stringsAsFactors = F)
libs_2500 = read.table(paste0(folder_location, "/data/richard_2500.tab"), header = TRUE, stringsAsFactors = FALSE)
# There's an empty well on one of the plates too - we should mark this as "unexpected"
libs$expected[libs$empty_well] = FALSE
libs_2500$expected[libs_2500$empty_well] = FALSE
In the first dataset analysed, we have two plates of single-cell RNA-seq libraries (mouse blood cells). We used dual indices for cell labelling (a different barcode was used at each end of the molecule). The barcodes used for each plate are from mutually exclusive sets - any barcode from one plate was never used on the other (Figure 2). For sequencing, all cell libraries from the two plates were multiplexed.
ggplot(data = libs, mapping = aes(x = bc1, y = bc2)) +
geom_tile(aes(fill = expected), col = "grey50") +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Barcode 1", y="Barcode 2") +
scale_fill_manual(values = c("FALSE" = "grey30", "TRUE" = "coral"), name = "", labels = c("No cell", "Cell present"))
Figure 2: Richard data experimental design
Cells were seeded in each of the orange positions in the plot. Neither set of barcodes used for a given 96-well plate were used in the other. One barcode combintion (N729,S522) was left without a cell
Cells were prepared for single-cell RNA-seq by the SmartSeq2 protocol(Picelli et al. 2014) with the following modifications. Cells were sorted into 96-well plates holding 4 µL of lysis buffer composed of 2.3 U SUPERase In RNase inhibitor (Thermo Fisher Scientific), 0.11 % (v/v) Triton X-100 (Sigma), 12.5 mM DTT (Thermo Fisher Scientific), and 2.5 mM dNTP mix (Thermo Fisher Scientific). 1 µL annealing mix containing diluted ERCC RNA Spike-In Mix (Thermo Fisher Scientific) and 10 µM oligo-dT30VN (Sigma) was added to each well before to reverse transcription with SuperScript II (Invitrogen). cDNA amplification was performed with 23 PCR cycles and the resulting PCR products purified with Ampure XP Beads (Agencourt) at a volume ratio of 0.7:1 beads:DNA. Libraries were prepared using Nextera XT DNA Sample Preparation Kit and indexes from the Nextera XT Index Kit v2 Set A and Nextera XT Index Kit v2 Set D (Illumina). Each plate of libraries was pooled, cleaned with Ampure XP beads, and quantitated using KAPA Library Quantification Kit (Roche). Equimolar quantities of libraries from each plate were combined for sequencing.
Read mapping was performed using Subread (v1.5.1; considering uniquely mapped reads and with Phred offset of 33, with otherwise default parameters) using the Ensembl mm10 genome with ERCC92 annotation.
Reads were counted using the featureCounts function in the Rsubread packageRsubread (default options, except minMQS=10) using the Ensembl mm10 GTF file, and Thermo-Fisher ERCC92 GTF (see the Marioni Lab Github).
Reads were demultiplexed allowing for any of the barcode combinations shown in Figure 2 (including the impossible combinations).
In the absence of barcode swapping, it should be impossible to observe mapped reads with barcodes from each of the two different plates, i.e., there should be no mapped reads in the grey areas of Figure 2. We will refer to these barcode combinations as “impossible” combinations, in comparison to the expected combinations in orange. Note that we consider only mapped reads as these are less likely to derive from various sequencing artefacts or contamination from e.g. human or bacterial sources (especially in the empty well).
We plot in Figure 3 the observed numbers of mapped reads for each barcode combination. Where cells have been loaded, there are many mapped reads (yellow/green areas). In the impossible barcode combinations we observe a lower but non-zero number of mapped reads.
ggplot(data = libs, mapping = aes(x = bc1, y = bc2)) +
geom_tile(aes(fill = log10(mapped))) +
theme(axis.text.x = element_text(angle = 90)) +
scale_fill_viridis(name = "log10\nmapped\ncounts") +
labs(x="Barcode 1", y="Barcode 2")
Figure 3: Number of mapped read per barcode combination
The number of mapped reads is shown for each barcode combination, as a log10 count.
medfrac = median(libs$reads[!libs$expected])/median(libs$reads[libs$expected])
mappedfrac = median(libs$mapped[!libs$expected])/median(libs$mapped[libs$expected])
totfrac = sum(libs$mapped[!libs$expected])/sum(libs$mapped[libs$expected])
These library sizes are shown in Figure 4 below. The impossible combinations have a median library size that is 1.5% of the median size of the expected combinations. The corresponding value for all reads (i.e. mapped and unmapped) is similar, at 1.8%. Considering all mapped reads on the plate, there are 1.1% in unexpected combinations relative to the expected combinations. Note that we consider the empty well to be an ``impossible’’ barcode combination though it does contain a low level of mapped ERCC spike-in reads.
ggplot(data = libs, mapping = aes(x = expected, y = mapped)) +
geom_boxplot(fill = c("royalblue2","springgreen3")) +
scale_y_log10(breaks = 10^(seq(3,6)), limits = c(1e3, max(libs$mapped)), labels = c("1,000", "10,000", "100,000", "1,000,000")) +
labs(x = "Real cell", y = "Library size") +
theme_bw() +
scale_x_discrete(labels = c("\"Impossible\" barcode\ncombination", "Expected barcode\ncombination"), name = "")
Figure 4: Mapped library sizes of seeded and unseeded wells
The library sizes of the unexpected barcode combinations (i.e. wells without cells) are lower than those of the expected combinations (wells with cells), but considerably larger than 0.
If we make the assumption that barcode swapping is rare, then it is unlikely that one molecule will undergo more than one round of swapping. We also assume that the number of observed swapped transcripts is proportional to the number of transcripts that are available for swapping. If this is true, then the size of each of the impossible combinations will be proportional to the sum of the library sizes of the expected combinations that share a single barcode. These are the reads for which a single barcode swap could move them to the impossible barcode combination.
For example, the combination N719 and S505 (red position) would take swapping contributions from the cells in the blue positions in Figure 5.
is_target = libs$bc1 == "N719" & libs$bc2 == "S505"
avail = (libs$bc1 == "N719" | libs$bc2 == "S505") & libs$expected
schem_cols = rep("black", nrow(libs))
schem_cols[libs$expected] = "grey40"
schem_cols[is_target] = "indianred"
schem_cols[avail] = "cornflowerblue"
col_index = unique(schem_cols); names(col_index) = col_index
ggplot(data.frame(bc1 = libs$bc1, bc2 = libs$bc2, cols = schem_cols), aes(x = bc1, y = bc2)) +
geom_tile(aes(fill = as.character(cols))) +
scale_fill_manual(values = col_index) +
theme(legend.position = "none") +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Barcode 1", y = "Barcode 2")
Figure 5: Considered ‘available’ transcript for barcode combination (N719,S505)
We assume that the swapped transcript for the red labelled barcode combination will derive from the blue labelled combinations on the cell-loaded plate. These combinations need to swap only a single barcode to arrive at the red labelled cell. Cell-loaded combinations are shown in grey, and unloaded combinations in black.
We now more formally define the model:
Let \(X_{i,j}\) denote the mapped library size for the library indexed by barcodes \(i\) (\(i \in {1,\dots,16}\), along rows; \(i=1\Rightarrow S522, i=16\Rightarrow S502\)) and \(j\) (\(j \in {1,\dots,24}\), along columns; \(j=1\Rightarrow N701, j=24\Rightarrow N729\)). The experimental design results in cells being seeded in two blocks of 96 wells (see Figure 2), corresponding to \(1 \leq i \leq 8\), \(13 \leq j \leq 24\) and \(9 \leq i \leq 16\), \(1 \leq j \leq 12\). Remaining barcode combinations (in the off-diagonal blocks) were not seeded with any cells. Additionally the well with index \(i=1,j=24\) was also empty by design.
Since index swapping is assumed to occur only for cells sharing one barcode, we assume that the library size of the unoccupied barcode combinations is proportional to the library size of cell-seeded barcode combinations that share a single index.
#melting the data frame purely to make casting it easier
molten = melt( libs[libs$expected,] , id.vars = c( "bc1" , "bc2" ) , measure.vars = "mapped" )
#casted to make easier to rowSums/colSums to get the contributions per barcode
casted = acast( molten , bc1~bc2 )
nsums = rowSums(casted, na.rm = T)
ssums = colSums(casted, na.rm = T)
unexp = libs[!libs$expected, ]
#now make the data frame from unexpected combinations for model fit
cell_outs = lapply(1:nrow(unexp), function(x) data.frame(mapped = unexp$mapped[x],
col = ssums[unexp$bc2[x]],
row = nsums[unexp$bc1[x]]))
df = do.call(rbind, cell_outs)
df$tot = df$col + df$row
ggplot(df, aes(x = tot, y = mapped)) +
geom_point(col = "grey50") +
geom_smooth(method = "lm", se = FALSE, col = "black", formula = y ~ x) +
# scale_x_log10() + scale_y_log10() +
labs(x = expression("Available swapping reads,"~S["i,j"]), y = expression("Observed swapped reads,"~X["i,j"])) +
theme_bw() +
scale_x_continuous(breaks = c(5e6, 1e7, 1.5e7), labels = c("5,000,000", "10,000,000", "15,000,000")) +
scale_y_continuous(breaks = seq(2500, 10000, 2500), labels = c("2,500", "5,000", "7,500", "10,000"))+
theme(axis.text = element_text(face = "bold", size = 11),
axis.title.y = element_text(size = 13, face = "bold"),
axis.title.x = element_text(size = 13, face = "bold"))
Figure 6: Swapping model fit
The one-barcode-distant available transcript and observed swapped transcript counts are shown, with the model fit described above overlaid.
mod_mapped = lm(df$mapped ~ df$tot)
#retain for later use
df_4000 = df
We therefore fit the model: \[X_{i,j} = \alpha + \beta S_{i,j}\] for
\[1 \leq i \leq 8, 1\leq j \leq 12, \text{and } 9\leq i \leq 16, 13\leq j \leq 24\]
where
\[S_{i,j} = \begin{cases} \sum_{k=13}^{24} X_{i,k} + \sum_{l=9}^{16} X_{l,j} \mbox{ for } 1 \leq i \leq 8, 1 \leq j \leq 12 \\ \sum_{k=1}^{12} X_{i,k} + \sum_{l=1}^{8} X_{l,j} \mbox{ for } 9 \leq i \leq 16, 13 \leq j \leq 24 \end{cases}\]
using ordinary least squares. Data and the model fit are shown in Figure 6. The two variables are highly correlated (\(R^2\) = 0.811).
The gradient is 0.000576, which corresponds to the rate at which barcode swapped molecules appear in the unexpected barcode combination libraries from the 20 donor libraries (see Figure 5).
However, barcodes are also swapping from each of the donor libraries into all others that share a barcode - into 38 wells. For example, the red cell shown in Figure 7 below can swap reads into all of the blue cells.
is_target = libs$bc1 == "N705" & libs$bc2 == "S505"
avail = (libs$bc1 == "N705" | libs$bc2 == "S505") #& libs$expected
schem_cols = rep("black", nrow(libs))
schem_cols[libs$expected] = "grey40"
schem_cols[avail] = "cornflowerblue"
schem_cols[is_target] = "indianred"
col_index = unique(schem_cols); names(col_index) = col_index
ggplot(data.frame(bc1 = libs$bc1, bc2 = libs$bc2, cols = schem_cols), aes(x = bc1, y = bc2)) +
geom_tile(aes(fill = as.character(cols))) +
scale_fill_manual(values = col_index) +
theme(legend.position = "none") +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Barcode 1", y = "Barcode 2")
Figure 7: Pattern of swapping
Our model was fitted to the empty barcode combinations, which received libraries from the real cells. Each of those donor cells (one is shown here in red) swaps not into one particular library, but into all 38 recipient libraries that share a single barcode (here shown in blue). Cell-loaded combinations are shown in grey, and unloaded combinations in black.
So, to estimate the fraction of all reads that have swapped (the swapping rate), we multiply the gradient by 38, giving an estimated swapping rate of 2.187±0.0765%. Notably, this is higher than the median-to-median fraction of 1.5%. This is because the median-to-median fraction only considered swapped reads arriving in the unexpected barcode combinations, whereas this gradient-estimated value considers reads swapping across the entire set of barcode combinations.
The intercept value for the model is -398.2±168.8, which is close to zero as expected.
The exact same pool of libraries was also sequenced on a HiSeq 2500. We see strong correlation between library sizes from the two machines, as shown in Figure 8.
libs = libs[order(libs$bc1, libs$bc2),]
libs_2500 = libs_2500[order(libs_2500$bc1, libs_2500$bc2),]
if(!all( (libs$bc1 == libs_2500$bc1) & (libs$bc2 == libs_2500$bc2) )){
print("barcodes don't match")
}
ggplot(data = data.frame(hiseq_4000 = libs$mapped, hiseq_2500 = libs_2500$mapped, expected = libs$expected),
mapping = aes(x=hiseq_4000, y = hiseq_2500, col = factor(expected, levels = c(TRUE, FALSE), labels = c("Real cell", "\"Impossible\" barcode\ncombination")))) +
geom_point(alpha = 0.7) +
scale_x_log10( breaks = 10^(4:6), labels = c("10,000", "100,000", "1,000,000"), name = "HiSeq 4000 Library size") +
scale_y_log10( breaks = 10^(3:5), labels = c("1,000", "10,000", "100,000"), name = "HiSeq 2500 library size") +
scale_color_brewer(palette = "Set1", name = "") +
theme_bw()
Figure 8: Library sizes between sequencing machines
We observe excellent correlation between library sizes of the two sequencing machines, particularly for the real cell libraries.
mappedfrac_2500 = median(libs_2500$mapped[!libs_2500$expected])/median(libs_2500$mapped[libs_2500$expected])
totfrac_2500 = sum(libs_2500$mapped[!libs_2500$expected])/sum(libs_2500$mapped[libs_2500$expected])
In the HiSeq 2500 data, many fewer reads are present in the unexpected barcode combinations. The unexpected combinations have a median library size of 0.11% the median size of the expected combinations (vs. 1.5% from HiSeq 4000). Considering all mapped reads on the plate, there are 0.084% in the unexpected combinations as in the expected ones (vs. 1.1% from HiSeq 4000).
We have applied the same model to the HiSeq 2500 data as described above for the HiSeq 4000 data. The fit and data is displayed graphically in Figure 9
molten = melt( libs_2500[libs_2500$expected,] , id.vars = c( "bc1" , "bc2" ) , measure.vars = "mapped" )
casted = acast( molten , bc1~bc2 )
nsums = rowSums(casted, na.rm = T)
ssums = colSums(casted, na.rm = T)
unexp = libs_2500[!libs_2500$expected, ]
cell_outs = lapply(1:nrow(unexp), function(x) data.frame(mapped = unexp$mapped[x],
col = ssums[unexp$bc2[x]],
row = nsums[unexp$bc1[x]]))
df = do.call(rbind, cell_outs)
df$tot = df$col + df$row
df_outlier = df[-which.max(df$mapped),]
mod_mapped_outlier = lm(df_outlier$mapped ~ df_outlier$tot)
mod_mapped_2500 = lm(df$mapped ~ df$tot)
ggplot(df, aes(x = tot, y = mapped)) +
geom_abline(slope = coef(mod_mapped_outlier)[2], intercept = coef(mod_mapped_outlier)[1], lwd = 1.5 ) +
geom_point(x = df$tot[which.max(df$mapped)], y = max(df$mapped), col = "coral", size = 3, alpha = 1) +
geom_point(col = "grey50") +
# geom_smooth(method = "lm", se = FALSE, col = "black", formula = y ~ x) +
# scale_x_log10() + scale_y_log10() +
labs(x = expression("Available swapping reads,"~S["i,j"]), y = expression("Observed swapped reads,"~X["i,j"])) +
theme_bw() +
scale_x_continuous(breaks = c(5e6, 1e7), labels = c("5,000,000", "10,000,000")) +
scale_y_continuous(breaks = c(0, 600, 1200), labels = c("0", "600", "1,200"), limits = c(0,1250))+
theme(axis.text = element_text(face = "bold", size = 11),
axis.title.y = element_text(size = 13, face = "bold"),
axis.title.x = element_text(size = 13, face = "bold")) +
annotate(geom = "text", x = df$tot[which.max(df$mapped)], y = max(df$mapped)-75, label = "Outlier removed from model fit")
Figure 9: Model fit for HiSeq 2500 data
A similar trend as for the HiSeq 4000 data can be observed. We have excluded one outlying data point from the fit to prevent disproportionate influence on the prediction of the gradient.
Interestingly, we do still observe the swapping pattern. However, the swapping rate estimated is around an order of magnitude lower on the HiSeq 2500: 0.224±0.00955% on the HiSeq 2500 vs. 2.187±0.0765% on the HiSeq 4000.
#Load the 4000/2500 blood data
noswap = as.matrix(read.table(paste0(folder_location, "/data/nest_2500.tab"), header = T, row.names = 1))
swap = as.matrix(read.table(paste0(folder_location, "/data/nest_4000.tab"), header = T, row.names = 1))
#some of the cells had to be re-pooled, so not sequencing the exact same libraries. Remove these.
newpool_lanes = which(grepl("9527", colnames(swap)) | grepl("9548", colnames(swap)) | grepl("9549", colnames(swap)) | grepl("9550", colnames(swap)))
noswap_all = noswap[, -newpool_lanes]
swap_all = swap[, -newpool_lanes]
#drop spikes, should be same level in all wells --> no net swapping
noswap = noswap[grepl("ENSMUSG", rownames(noswap)), -newpool_lanes]
swap = swap[grepl("ENSMUSG", rownames(swap)), -newpool_lanes]
#split into plates
blood_meta = data.frame(cell = colnames(swap))
split_1 = strsplit(as.character(blood_meta$cell), split = ".", fixed = T)
blood_meta$plate = sapply(split_1, function(x) x[2])
split_2 = strsplit(sapply(split_1, function(x) x[3]), split = "_")
blood_meta$column = sapply(split_2, function(x) x[1])
blood_meta$row = sapply(split_2, function(x) x[2])
swap_split = lapply(split(as.data.frame(t(swap)), blood_meta$plate), function(x) t(as.matrix(x)))
noswap_split = lapply(split(as.data.frame(t(noswap)), blood_meta$plate), function(x) t(as.matrix(x)))
To confirm the existance of barcode swapping in multiple datasets, we have used data from a published study (Nestorowa et al. 2016), which we refer to as the Nestorowa data. Each of 16 plates of single-cell libraries was pooled and sequenced in a single lane on a HiSeq 2500, and at a later date the same pooled libraries were sequenced on a HiSeq 4000. Because exactly the same pools of libraries were used, the only differences between the results should be caused by the sequencing machines and Poisson sampling noise (Marioni et al. 2008). It is important that the libraries were not repooled, as this would likely induce systematic differences in the proportions that make up the pool, which may confound results.
First, we identified the gene with the most read counts in a single cell in the first plate of cells - Igkc. This gene is almost uniquely expressed in one cell in the plate. On both machines, we observe the same pattern of “crosshair” swapping (along the row and column of a highly expressing cell) as reported in Sinha et al. (2017) shown in Figures 10 and 11.
#small function to get the fraction of counts on the whole plate
#in the crosshair of the maximally expressing cell
get_fraction = function(gene, meta, counts){
expr = counts[gene,]
cell = names(expr)[which.max(expr)]
bc1 = meta$column[meta$cell == cell]
bc2 = meta$row[meta$cell == cell]
cells = as.character(meta$cell[meta$column == bc1 | meta$row == bc2])
cells = cells[-which(cells == cell)]
return(sum(expr[cells])/expr[cell])
}
#generates a crosshair-like plot
plate_plot = function(gene, meta, counts, log_transform = T, plot = F){
if(nrow(meta)!=ncol(counts)){
stop("meta/counts not same size")
}
if(log_transform){
expr = log2(counts[gene,] + 1 )
} else {
expr = counts[gene, ]
}
mat = matrix(NA,
ncol = length(unique(meta$column)),
nrow = length(unique(meta$row)),
dimnames = list(
unique(meta$row)[order(unique(meta$row))],
unique(meta$column)[order(unique(meta$column))]
))
for(i in 1:length(expr)){
mat[meta$row[i], meta$column[i]] = expr[i]
}
# pheatmap(mat, cluster_rows = F, cluster_cols = F, display_numbers = T,
# color = colorRampPalette(brewer.pal(n = 7, name ="Spectral"))(100),
# breaks = seq(from = 0, to = max(mat, na.rm = T) + 1e-8, length.out = 101))
melt = melt(mat)
plot_out = ggplot(melt, aes(x = Var2, y = Var1)) +
geom_tile(aes(fill = value), color = "white") +
scale_fill_gradientn(colors = c("grey70", "cornflowerblue", "black"), name = "log2(count+1)") +
# scale_fill_viridis(name = "log2(count+1)") +
annotate(geom = "text", x = melt$Var2, y = melt$Var1, label = format(melt$value, digits = 2), color = "white") +
labs(x = "", y = "")
if(plot){
print(plot_out)
}
return(list(matrix = mat, plot = plot_out))
}
plate_2500 = plate_plot(gene = rownames(noswap_split[[1]])[row(noswap_split[[1]])[which.max(noswap_split[[1]])]],
meta = blood_meta[blood_meta$cell %in% colnames(noswap_split[[1]]),],
counts = noswap_split[[1]])
print(plate_2500[[2]] + ggtitle("HiSeq 2500"))
Figure 10: Crosshair pattern on HiSeq 2500
The highly expressed gene at position (N708, S508) is seen to increase the read counts for cells that share a single barcode. The experimental design used in the experiment does not promote this behaviour.
frac_2500 = get_fraction(gene = rownames(noswap_split[[1]])[row(noswap_split[[1]])[which.max(noswap_split[[1]])]],
meta = blood_meta[blood_meta$cell %in% colnames(noswap_split[[1]]),],
counts = noswap_split[[1]])
plate_4000 = plate_plot(gene = rownames(swap_split[[1]])[row(noswap_split[[1]])[which.max(noswap_split[[1]])]],
meta = blood_meta[blood_meta$cell %in% colnames(noswap_split[[1]]),],
counts = swap_split[[1]])
print(plate_4000[[2]] + ggtitle("HiSeq 4000"))
Figure 11: Crosshair pattern on HiSeq 4000 The highly expressed gene at position (N708, S508) is seen to increase the read counts for cells that share a single barcode
The experimental design used in the experiment does not promote this behaviour.
frac_4000 = get_fraction(gene = rownames(swap_split[[1]])[row(noswap_split[[1]])[which.max(noswap_split[[1]])]],
meta = blood_meta[blood_meta$cell %in% colnames(noswap_split[[1]]),],
counts = swap_split[[1]])
It is clear that the pattern is stronger on the HiSeq 4000. There are 1.93% as many Igkc reads in the crosshair as in the central highly expressing cell in the 4000 data, compared to 0.257% for the HiSeq 2500. This is consistent with the approximately order of magnitude difference identified in the swapping rate between the two technologies from the Richard data.
We can show the divergence between the machines more clearly by considering the difference between the log2 counts, shown in Figure 12.
delta = plate_4000[[1]] - plate_2500[[1]]
melt = melt(delta)
plot_out = ggplot(melt, aes(x = Var2, y = Var1)) +
geom_tile(aes(fill = value), color = "white") +
# scale_fill_gradientn(colors = c("coral" ,"grey70", "cornflowerblue", "black"), name = "delta log2(count+1)", values = rescale(c(min(melt$value), 0, max(melt$value)/2, max(melt$value)))) +
scale_fill_gradientn(colors = c("royalblue3" ,"grey80", "indianred3"), name = "delta log2(count+1)", values = rescale(c(min(melt$value), 0, max(melt$value)))) +
annotate(geom = "text", x = melt$Var2, y = melt$Var1, label = format(melt$value, digits = 2), color = "white") +
labs(x = "", y = "") +
ggtitle("HiSeq 4000 - HiSeq 2500")
plot_out
Figure 12: Sequencing machine crosshair differences
The difference between the log2 counts of the gene Igkc is shown between the two sequencing machines - more reads are concentrated in the crosshair on the HiSeq 4000 than the HiSeq 2500, which is suggestive of swapping.
We applied a model that identifies contributions of different cells in the HiSeq 2500 data (treating this as a baseline, low-swapping system (Sinha et al. 2017)) to the swapping-affected transcriptomes of the HiSeq 4000 data. This will allow quantification of an overall swapping rate across multiple genes.
Let \(C^{4000}_{g \times c}\) denote the count matrix for the HiSeq 4000 libraries, where position (\(g_i\), \(c_j\)) corresponds to the expression level of gene \(i\) in cell \(j\). Let \(C^{2500}_{g \times c}\) denote the count matrix for the HiSeq 2500 libraries. Let \(M_{c \times c}\) denote a matrix capturing the contribution of the HiSeq 2500 counts to the HiSeq 4000 counts. Each entry of \(M_{c \times c}\) defines the proportion of each HiSeq 2500 library that makes up each HiSeq 4000 library.
More specifically, if \(c\) indexes the columns of \(M\) (\(1 \leq c \leq cell\)) and \(r\) indexes the rows (\(1 \leq r \leq cell\)), then \((r, c)\) of \(M_{c\times c}\) represents the contribution of cell \(r\)’s expression profile in the HiSeq 2500 data to the expression profile of cell \(c\) in the HiSeq 4000 data. For a cell with a given pair of barcodes, we want to discriminate between the contribution of other cells with exactly one shared barcode and other cells with no shared barcodes to the expression profile of the cell of interest. To do this, we define the elements of the matrix \(M_{r,c}\) as:
\[ M_{r,c} = \begin{cases} \alpha & \text{if cell $c$ and cell $r$ share both barcodes},\\ \beta & \text{if cell $c$ and cell $r$ share exactly one barcode},\\ \gamma & \text{if cell $c$ and cell $r$ share neither barcodes}. \end{cases} \]
Subsequently, let:
\[ C^{4000}_{g \times c} = C^{2500}_{g \times c} M_{c \times c} + \epsilon\]
where \(\epsilon\) represents the residual error. The aim is to obtain estimates of \(\alpha\), \(\beta\), and \(\gamma\) for each plate, using information across many genes to stabilise the estimates.
Gene selection is important for the fitting of this model. We look at the expression of two genes on the first HiSeq 2500 plate in the dataset in Figures 13 and 14
rowMax = function(mat){
return(apply(mat, 1, max))
}
#pull out some genes for demo of selection
gene_vars = apply(noswap_split[[1]][rowMax(noswap_split[[1]])>500,], 1, function(x) max(x)/quantile(x, 0.9))
gene_vars = gene_vars[order(gene_vars, decreasing = TRUE)]
plot1 = plate_plot(gene = names(gene_vars[30]), meta = blood_meta[blood_meta$cell %in% colnames(noswap_split[[1]]),], counts = noswap_split[[1]], log_transform = TRUE, plot = FALSE)
print(plot1[[2]] +ggtitle("Gene A"))
Figure 13: Gene expression pattern of gene A
The expression of gene A is shown in log-counts for cells on the first plate of this dataset.
plot2 = plate_plot(gene = names(gene_vars[1000]), meta = blood_meta[blood_meta$cell %in% colnames(noswap_split[[1]]),], counts = noswap_split[[1]], log_transform = TRUE, plot = FALSE)
print(plot2[[2]] + ggtitle("Gene B"))
Figure 14: Gene expression pattern of gene B
The expression of gene A is shown in log-counts for cells on the first plate of this dataset. Gene B is expressed widely across the plate.
Both genes are expressed in only a subset of the cells on the plate, but gene B is expressed much more broadly than gene A. This poses a problem for the deconvolution model: for gene B, it is harder for the model to distinguish between contributions from other cells on the row and column of a certain cell (\(\beta\)), and from all the other cells on the plate (\(\gamma\)), because many cells in both of these sets express the gene at high levels. By contrast, gene A is expressed at a high level in only very few cells; it is therefore easier for the model to distinguish between large values of \(\beta\) and \(\gamma\). Genes that are expressed broadly in the manner of gene B are therefore less informative than genes like gene A for the model fit.
In order to capture the most informative genes, we first subset by expression levels: genes are retained if they are present at a minimum level of 500 counts in at least one cell in the HiSeq 2500 data. It is important to have a large number of transcripts present, otherwise swapping may be too rare to detect.
Then, to find genes like gene A, we rank genes according to their maximum expression on the plate divided by the 90th percentile. This value will be highest when only a very few cells are highly expressing a gene.
These scores are plotted in decreasing order for a few plates in Figure 15:
set.seed(42)
plates = sample(length(noswap_split), 4)
gene_vars = lapply(plates, function(i)
apply(noswap_split[[i]][rowMax(noswap_split[[i]])>500,], 1, function(x) max(x)/quantile(x, 0.9)))
gene_vars = lapply(gene_vars, function(x) x[order(x, decreasing = TRUE)])
gene_vars = lapply(gene_vars, function(x) x[x!=Inf])
plate_ids = lapply(1:length(gene_vars), function(x) rep(plates[x], length(gene_vars[[x]])))
indices = lapply(gene_vars, function(x) 1:length(x))
plot_df = data.frame(score = unlist(gene_vars), plate = paste("plate", unlist(plate_ids)), index = unlist(indices))
ggplot(plot_df, aes(y = score, x = index)) +
geom_point(size = 0.5, col = "darkgrey") +
facet_wrap(~plate, nrow = 2) +
scale_y_log10(breaks = c(10, 100, 1000), labels = c("10", "100", "1,000"), name = "max / 90% quantile") +
theme_bw() +
lims(x = c(1, 1000))
Figure 15: Informative genes for four plates
For four plates, chosen at random from the dataset, genes are ranked in descending order by ratio of their maximum expression on the plate and the 90% quantile expression. Genes are plotted in descending order, with infinite values of the ratio excluded from plots, but used in the models (there are at most 33 such ratios in one plate). Only the first 1000 genes are shown.
In these plates, the top 250 genes contain most of the informative genes. We consider 500 genes to ensure that we include the infinite ratios (where the 90th percentile is 0) and do not exclude informative genes for other plates than those shown above, which may have longer tails to the distribution shown. We have also run fits considering both 250 genes and 1000 genes, and obtain similar parameter estimates (shown below).
Notably, the maximum expression value on a plate does not substantially drive selection of genes, shown for two plates in Figure 16.
vars_1 = apply(noswap_split[[1]][rowMax(noswap_split[[1]])>500,], 1, function(x) max(x)/quantile(x, 0.9))
maxs_1 = rowMax(noswap_split[[1]][rowMax(noswap_split[[1]])>500,])
keep_1 = vars_1 >= vars_1[order(vars_1, decreasing = TRUE)][500]
vars_2 = apply(noswap_split[[2]][rowMax(noswap_split[[2]])>500,], 1, function(x) max(x)/quantile(x, 0.9))
maxs_2 = rowMax(noswap_split[[2]][rowMax(noswap_split[[2]])>500,])
keep_2 = vars_2 >= vars_2[order(vars_2, decreasing = TRUE)][500]
plot_df = data.frame(score = c(vars_1, vars_2), max = c(maxs_1, maxs_2), plate = c(rep("Plate 1", length(vars_1)), rep("Plate 2", length(vars_2))), keep = c(keep_1, keep_2))
plot_df = plot_df[plot_df$score!= Inf,]
ggplot(plot_df, aes(x = max, y = score, col = factor(keep))) +
geom_point() +
facet_grid(. ~ plate) +
scale_x_log10(name = "Maximum gene expression count", breaks = 10^(3:5), labels = c("1,000", "10,000", "100,000")) + scale_y_log10("Max/90% score") +
scale_color_manual(labels = c("Dropped gene", "Analysed gene"), name = "", values = c("darkgrey", "coral"))
Figure 16: Effect of expression magnitude on gene selection
The ratio we use to select genes is not driven strongly by the size of the numerator: we are not simply selecting for genes with the highest expression in one cell.
The model fit used Poisson precision weights: specifically, counts (both response \(C^{4000}\) and predictors \(C^{2500}\)) were multiplied by the reciprocal of the square-root of their gene’s mean expression, to prevent the most highly expressed genes dominating the least-squares fit. A constrained linear inverse model was used for the fit to avoid obtaining negative values of \(\alpha\), \(\beta\), and \(\gamma\) (using the package lsei). Gene subsetting and fitting of the model was undertaken separately for each plate of cells, so each plate uses its own informative gene set.
#This chunk is particularly slow - perhaps run on only a subset of plates if you want speed
#we needed to add a line to the source code of this function to avoid
#an overflow issue in fortran
#The line just divides all the values of a matrix by a constant value
#otherwise the function is unchanged from the package lsei
my_lsei = function (A = NULL, B = NULL, E = NULL, F = NULL, G = NULL,
H = NULL, Wx = NULL, Wa = NULL, type = 1, tol = sqrt(.Machine$double.eps),
tolrank = NULL, fulloutput = FALSE, verbose = TRUE)
{
require(quadprog)
if (is.vector(E) & length(F) == 1)
E <- t(E)
else if (!is.matrix(E) & !is.null(E))
E <- as.matrix(E)
if (is.vector(A) & length(B) == 1)
A <- t(A)
else if (!is.matrix(A) & !is.null(A))
A <- as.matrix(A)
if (is.vector(G) & length(H) == 1)
G <- t(G)
else if (!is.matrix(G) & !is.null(G))
G <- as.matrix(G)
if (!is.matrix(F) & !is.null(F))
F <- as.matrix(F)
if (!is.matrix(B) & !is.null(B))
B <- as.matrix(B)
if (!is.matrix(H) & !is.null(H))
H <- as.matrix(H)
if (is.null(A) && is.null(E)) {
if (is.null(G))
stop("cannot solve least squares problem - A, E AND G are NULL")
A <- matrix(data = 0, nrow = 1, ncol = ncol(G))
B <- 0
}
else if (is.null(A)) {
A <- matrix(data = E[1, ], nrow = 1)
B <- F[1]
}
Neq <- nrow(E)
Napp <- nrow(A)
Nx <- ncol(A)
Nin <- nrow(G)
if (is.null(Nx))
Nx <- ncol(E)
if (is.null(Nx))
Nx <- ncol(G)
if (is.null(Nin))
Nin <- 1
if (is.null(Neq)) {
Neq <- 0
if (verbose & type == 1)
warning("No equalities - setting type = 2")
type = 2
F <- NULL
}
else {
if (ncol(E) != Nx)
stop("cannot solve least squares problem - A and E not compatible")
if (length(F) != Neq)
stop("cannot solve least squares problem - E and F not compatible")
}
if (is.null(G))
G <- matrix(data = 0, nrow = 1, ncol = Nx)
if (is.null(H))
H <- 0
if (ncol(G) != Nx)
stop("cannot solve least squares problem - A and G not compatible")
if (length(B) != Napp)
stop("cannot solve least squares problem - A and B not compatible")
if (length(H) != Nin)
stop("cannot solve least squares problem - G and H not compatible")
if (!is.null(Wa)) {
if (length(Wa) != Napp)
stop("cannot solve least squares problem - Wa should have length = number of rows of A")
A <- A * Wa
B <- B * Wa
}
Tol <- tol
if (is.null(Tol))
Tol <- sqrt(.Machine$double.eps)
IsError <- FALSE
if (type == 1) {
ineq <- Nin + Nx
mIP <- ineq + 2 * Nx + 2
lpr <- 1
if (fulloutput)
lpr <- lpr + 3
if (!is.null(tolrank))
lpr <- lpr + 6
if (!is.null(Wx)) {
lw <- length(Wx)
lpr <- lpr + 2 + lw
}
ProgOpt <- rep(1, lpr)
if (lpr > 1) {
ipr <- 1
if (fulloutput) {
ProgOpt[ipr:(ipr + 2)] <- c(ipr + 3, 1, 1)
ipr <- ipr + 3
}
if (!is.null(tolrank)) {
if (length(tolrank) == 1)
tolrank <- rep(tolrank, len = 2)
ProgOpt[ipr:(ipr + 5)] <- c(ipr + 6, 4, tolrank[1],
ipr + 6, 5, tolrank[2])
ipr <- ipr + 6
}
if (!is.null(Wx)) {
lw <- length(Wx)
if (lw == 1) {
ProgOpt[ipr:(ipr + 2)] <- c(ipr + 3, 2, 1)
}
else {
if (lw != Nx)
stop("cannot solve least squares problem - number of weighs should be =1 or =number of unknowns")
lw <- lw + ipr + 1
ProgOpt[ipr:lw] <- c(lw + 1, 3, Wx)
}
}
}
mdW <- Neq + Napp + ineq
if (fulloutput)
mdW <- max(mdW, Nx)
mWS <- 2 * (Neq + Nx) + max(Napp + ineq, Nx) + (ineq +
2) * (Nx + 7)
storage.mode(A) <- storage.mode(B) <- "double"
storage.mode(E) <- storage.mode(F) <- "double"
storage.mode(G) <- storage.mode(H) <- "double"
sol <- .Fortran("lsei", NUnknowns = Nx, NEquations = Neq,
NConstraints = Nin, NApproximate = Napp, A = A,
B = B, E = E, F = F, G = G, H = H, X = as.vector(rep(0,
Nx)), mIP = as.integer(mIP), mdW = as.integer(mdW),
mWS = as.integer(mWS), IP = as.integer(rep(0, mIP)),
W = as.double(matrix(data = 0, nrow = mdW, ncol = Nx +
1)), WS = as.double(rep(0, mWS)), lpr = as.integer(lpr),
ProgOpt = as.double(ProgOpt), verbose = as.logical(verbose),
IsError = as.logical(IsError))
if (any(is.infinite(sol$nX)))
sol$IsError <- TRUE
if (fulloutput) {
covar <- matrix(data = sol$W, nrow = mdW, ncol = Nx +
1)[1:Nx, 1:Nx]
RankEq <- sol$IP[1]
RankApp <- sol$IP[2]
}
}
else if (type == 2) {
if (!is.null(Wx))
stop("cannot solve least squares problem - weights not implemented for type 2")
if (!is.null(Wa))
stop("cannot solve least squares problem - weights not implemented for type 2")
dvec <- crossprod(A, B)
Dmat <- crossprod(A, A)
diag(Dmat) <- diag(Dmat) + 1e-08
#THIS IS THE ADDED CODE
#As per answer 2 at http://stackoverflow.com/questions/28381855/r-function-solve-qp-error-constraints-are-inconsistent-no-solution?rq=1 it appears we have overflow in the fortran code, so we scale as recommended
sc = base::norm(Dmat, "2")
Dmat = Dmat/sc
dvec = dvec/sc
Amat <- t(rbind(E, G))
bvec <- c(F, H)
sol <- solve.QP(Dmat, dvec, Amat, bvec, meq = Neq)
sol$IsError <- FALSE
sol$X <- sol$solution
}
else stop("cannot solve least squares problem - type unknown")
X <- sol$X
X[which(abs(X) < Tol)] <- 0
if (any(is.infinite(X))) {
residual <- Inf
solution <- Inf
}
else {
residual <- 0
if (Nin > 0) {
ineq <- G %*% X - H
residual <- residual - sum(ineq[ineq < 0])
}
if (Neq > 0)
residual <- residual + sum(abs(E %*% X - F))
if (residual > Tol)
sol$IsError <- TRUE
solution <- 0
if (Napp > 0)
solution <- sum((A %*% X - B)^2)
}
xnames <- colnames(A)
if (is.null(xnames))
xnames <- colnames(E)
if (is.null(xnames))
xnames <- colnames(G)
names(X) <- xnames
res <- list(X = X, residualNorm = residual, solutionNorm = solution,
IsError = sol$IsError, type = "lsei")
if (fulloutput && type == 1) {
res$covar <- covar
res$RankEq <- sol$IP[1]
res$RankApp <- sol$IP[2]
}
return(res)
}
#Function written largely by Aaron Lun
#Key function to estimating the contributions
# ARGS:
# plate_XXXX is the counts matrix for the cells to be compared
# bc1, bc2 are vectors for the barcodes for the cells in each of the plates
# constrain == TRUE uses a constrained fit, all coef > 0
# n.genes is the number of genes to select (see above for explanation)
# RETURN:
# vector of coefficients, "other" is cells that share no barcodes, "rowcol" is cells that share one barcode
# "self" is the cell that shares 2 barcodes (i.e. itself.)
get_mixing_matrix = function(plate_4000, plate_2500, bc1, bc2, constrain = FALSE, n.genes = 500){
nsamples = ncol(plate_4000)
#Gene selection
if(nrow(plate_4000)>1){
#straight SD version
# top_var = apply(plate_2500, 1, sd)
#As a fraction of the 0.9 quantile
#logic being that these are the genes that show the "crosshair" patterns
top_var = apply(plate_2500, 1, function(x) max(x)/quantile(x, 0.9))
#Max expression vs. second highest cell
# top_var = apply(plate_2500, 1, function(x) max(x)/max(x[-which.max(x)]))
# impose an expression limit - the gene must be expressed in a cell with at least 500 counts
# (otherwise some of the methods above pick up genes expressed with, say,
# one or two counts on the whole plate, where swapping cannot actually happen)
top_var = top_var[apply(plate_2500, 1, max)>500]
if(n.genes > length(top_var)){
top_genes = names(top_var)
warning(paste("Asked for more genes than qualified by abundance; instead using all", length(top_var), "abundant genes"))
} else {
top_genes = names(top_var)[order(top_var, decreasing = TRUE)][1:n.genes]
}
#perform subsetting
mixed <- plate_4000[top_genes,]
pure <- plate_2500[top_genes,]
} else {
#if only one gene was supplied we don't subset
mixed = plate_4000
pure = plate_2500
}
#establish relationships between cells RE whether they share a barcode
beta = matrix(NA, ncol(mixed), ncol(mixed))
shares_barcode = sapply(1:length(beta), function(x) grepl(bc1[row(beta)[x]], colnames(mixed)[col(beta)[x]]) | grepl(bc2[row(beta)[x]], colnames(mixed)[col(beta)[x]]))
beta = matrix(as.numeric(shares_barcode), nrow = ncol(mixed), ncol = ncol(mixed), dimnames = list(colnames(mixed), colnames(mixed)))
diag(beta) = 2
beta = beta+1
#now the parameters we are fitting are: 1 - non row/col; 2 - row/col; 3 - self
# Formulation of C_{4000} = M C_{2500} is "backwards" for a linear model - we need to solve for M, not C_{2500}
# Therefore we formulate a new design matrix in order to solve M - this section of code does this
# We have a complex set of contribution comparisons to make:
# Rows are the possible gene-cell combinations
# Columns are the sites-of-origin of counts i.e. col1 - shares no barcodes, col2 - shares one barcode, col3 - same cell
nbeta <- max(beta)
design <- responses <- vector("list", nrow(mixed))
for (g in seq_len(nrow(mixed))) {
collected.i <- collected.j <- collected.x <- vector("list", nsamples)
for (i in seq_len(nsamples)) {
collected.i[[i]] <- rep(i, nsamples)
collected.j[[i]] <- beta[i,]
collected.x[[i]] <- pure[g,]
}
collected <- sparseMatrix(i=unlist(collected.i), j=unlist(collected.j), x=unlist(collected.x))
#staying with Matrix here leads to an integer overflow of some sort
#coerce back to regular matrix
design[[g]] <- as.matrix(collected)
responses[[g]] <- mixed[g,]
}
design <- do.call(rbind, design)
#downweight by abundance, to ensure it is not only a few highly expressed genes driving a fit
if(nrow(mixed)>1){
gene_means = rowMeans(pure)
sqrts = sqrt(gene_means)
repped = unlist(lapply(sqrts, rep, ncol(mixed)))
design = sweep(design, 1, repped, "/")
responses = Matrix(unlist(responses))/repped
}
if(!constrain){
res = solve(qr(design), matrix(unlist(responses)))
out = res[,1]#/sum(res)
} else {
#this implementation of lsei prevents fortran overflows but is otherwise identical
res = my_lsei(A = as.matrix(design), B = unlist(responses), G = diag(ncol(design)), H = numeric(ncol(design)), type = 2)
out = res$X#/sum(res$X)
}
names(out) = c("other", "rowcol", "self")
return(out)
}
#utility function to extract barcodes from cell names
get_bcs = function(counts_mat){
bcs = strsplit(colnames(counts_mat), ".", fixed = T)
bc1 = sapply(strsplit(sapply(bcs, function(x) x[3]), "_"), function(y) y[1])
bc2 = sapply(strsplit(sapply(bcs, function(x) x[3]), "_"), function(y) y[2])
return(list(bc1, bc2))
}
bcs = lapply(swap_split, get_bcs)
#get the parameters
const = lapply(1:length(swap_split), function(i) get_mixing_matrix(swap_split[[i]], noswap_split[[i]], bc1 = bcs[[i]][[1]], bc2 = bcs[[i]][[2]], constrain = TRUE, n.genes = 500))
#rbind them together for all the plates
const = do.call(rbind, const)
#multiply up to total contributions
const_adjusted = sweep(const, 2, c(96-8-11, 11+7, 1), "*")
#normalise to one
const_adjusted = sweep(const_adjusted, 1, rowSums(const_adjusted), "/")
run_250 = lapply(1:length(swap_split), function(i) get_mixing_matrix(swap_split[[i]], noswap_split[[i]], bc1 = bcs[[i]][[1]], bc2 = bcs[[i]][[2]], constrain = TRUE, n.genes = 250))
run_250 = do.call(rbind, run_250)
adjust_250 = sweep(run_250, 2, c(96-8-11, 11+7, 1), "*")
adjust_250 = sweep(adjust_250, 1, rowSums(adjust_250), "/")
run_1000 = lapply(1:length(swap_split), function(i) get_mixing_matrix(swap_split[[i]], noswap_split[[i]], bc1 = bcs[[i]][[1]], bc2 = bcs[[i]][[2]], constrain = TRUE, n.genes = 1000))
run_1000 = do.call(rbind, run_1000)
adjust_1000 = sweep(run_1000, 2, c(96-8-11, 11+7, 1), "*")
adjust_1000 = sweep(adjust_1000, 1, rowSums(adjust_1000), "/")
Across the 16 plates assayed, we acquired a distribution of fitted values (0 to 0.00279) for the single shared barcode contribution term.
To quantify how much of a cell’s transcriptome derives from all other cells that share exactly one or exactly no barcodes, we multiplied \(\beta\) and \(\gamma\) by 18 and 77, respectively. Here, 18 and 77 correspond to the number of entries in each column (or row) of \(M\) that are \(\beta\) or \(\gamma\), respectively. In other words, they are for each cell the number of other cells on the plate that share exactly one or no barcodes.
We then normalised the estimates to ensure that \(\alpha + \beta + \gamma = 1\), as we wish to examine the fraction of HiSeq 4000 counts that are derived from other cells. Across plates, the mean contributions from the single barcode sharing cells (i.e. \(\beta\)) was estimated to be 2.068 ± 0.326% (Figure 17).
Similar results were also obtained with 250 (2.06 ± 0.298%) or 1000 (2.04 ± 0.372%) genes considered per plate.
hist(const_adjusted[,2], breaks = 20, xlim = c(0, 0.06), main = "", xlab = "Fitted fraction of transcriptome from other barcode-sharing cells")
Figure 17: Fitted values for single-barcode-sharing contributions
The contributions for cells that share one barcode are shown above, where replicates are for different plates of single cells. Different fitted values may derive from truly different swapping rates, or differences in the number of informative genes.
These values provide an estimated increase in swapping rate in the HiSeq 4000 data compared to the HiSeq 2500 data, not an absolute estimate. Nonetheless, these values are consistent with the rate of swapping calculated in the Richard dataset above. Assuming that HiSeq 2500 swapping occurs at a rate of 1/10th the rate of HiSeq 4000 swapping (based on Richard data, and the crosshair patterns shown in Figures 10 and 11), we could estimate an absolute HiSeq 4000 swapping rate in this data by multiplication of the HiSeq 4000 estimate by a factor of 1.1. This gives an estimate from the Nestorowa data of 2.275 ± 0.359%, which is consistent with the absolute swapping rate estimated from the Richard data of 2.187±0.0765%.
Contributions for the cells that share no barcodes (i.e. using \(\gamma\)) are shown in Figure 18. Despite there being many more cells that could contribute in this manner, the total contribution is low, and in many cases estimated to be 0.
hist(const_adjusted[,1], breaks = 20, xlim = c(0, 0.06), main = "", xlab = "Fitted fraction of transcriptome from non-barcode-sharing cells")
Figure 18: Fitted values for single-barcode-sharing contributions
The contributions for cells that share one barcode are shown above, where replicates are for different plates of single cells. Different fitted values may derive from truly different swapping rates, or differences in the number of informative genes.
In the crosshair plots shown in Figures 10 and 11, we showed a swapping pattern for a single gene for a single plate. We apply a model to each gene across all plates to identify whether swapping is happening transcriptome-wide. This differs from the previous model, which considered many genes together.
Let \(i\) denote the row index for a cell on the 96 well plate (\(1 \leq i \leq 8\)) and let \(j\) denote a cell’s column index (\(1 \leq j \leq 12\)). Subsequently, let \(C_{i,j,g}^{(t)}\) denote the read count of gene \(g\) in cell \((i,j)\) when profiled using sequencing technology \(t\), where \(t=2500\) refers to the HiSeq 2500 and \(t=4000\) refers to the HiSeq 4000. Additionally, let \[S_{i,j,g}^{(t)} = \sum_{k=1}^{12} C_{i,k,g}^{(t)} + \sum_{l=1}^{8} C_{l,j,g}^{(t)} - 2 C_{i,j,g}^{(t)}\] represent the total read count of gene \(g\) in technology \(t\) across all cells with index \(i\) or \(j\), excluding the cell with barcode combination \((i,j)\).
We first consider a null model without swapping where, for cell \((i,j)\), the number of reads mapped to each gene \(g\) in the HiSeq 4000 data set is only dependent on the number of reads mapped to the same cell in the HiSeq 2500 data. More specifically:
\[H_{0}: C_{i,j,g}^{4000} = \alpha_{g} C_{i,j,g}^{2500} + \epsilon_{i,j,g}\]
\(\alpha_g\) captures both read depth differences as well as any gene specific biases (e.g. GC content) that may change between the two sequencing machines.
As an alternative, we also consider an alternative model with a swapping term, where each cell in the HiSeq 4000 data receives and loses reads from swapping with cells that share exactly one barcode with it. Specifically, let:
\[H_{1}: C_{i,j,g}^{4000} = \alpha_g C_{i,j,g}^{2500} + \beta_g \left( S_{i,j,g}^{2500} - C_{i,j,g}^{2500} \right) + \epsilon_{i,j,g}\]
\(\beta_g\) allows the strength of the swapping term to vary between genes, so that we can identify whether all genes are affected by swapping or not. \(S_{i,j,g}^{2500}\) accounts for arrival of transcripts of gene \(g\) in a cell \((i,j)\) due to barcode swapping. Similarly, \(C_{i,j,g}^{2500}\) accounts for the loss of transcripts of gene \(g\) in cell \((i,j)\) due to barcode swapping (as without a barcode swap, a PCR duplicate would instead represent a new read for the gene in the original cell).
The model may also be parameterised as \[H_{1}: C_{i,j,g}^{4000} = \left(\alpha_g - \beta_g\right) C_{i,j,g}^{2500} + \beta_g S_{i,j,g}^{2500} + \epsilon_{i,j,g} = \alpha'_g C_{i,j,g}^{2500} + \beta_g S_{i,j,g}^{2500} + \epsilon_{i,j,g}\] but as the models are equivalent we fit the more swapping intuitive formulation presented above (i.e. including gain and loss of trancripts).
Models \(H_1\) and \(H_0\) are fitted by least squares, approximating the Poisson-distributed counts as normally distributed. We fitted to genes that are abundant (mean >50 counts across all plates) as swapping can only occur where there are sufficient molecules present to swap in the first place.
For each gene, evidence against \(H_{0}\) in favour of \(H_{1}\) can be established by performing an F-test (as the models are nested).
When \(H_{1}\) is favoured over \(H_{0}\), we are explaining the data better with the swapping term. However, the swapping term only operates in the way that we expect when \(\beta > 0\). If \(\beta < 0\), then the behaviour of swapping is inverted compared to that expected from the other work presented above.
When the model is applied, the vast majority of genes favour \(H_{1}\), with \(\beta>0\), as shown in Figure 19
#This chunk is very slow and computationally intensive. Change the min.mean setting in the line starting
# "gene_dfs = " if you need to speed it up.
#Note that these functions rely on the barcodes being in the names of the cells in a specific place
#returns the summed counts for all cells that share an index from the cell name, in the counts.
#char cell: cell name
#named vector named_count_row: a gene's row from the counts matrix, with cell names
get_summed = function(cell, named_count_row){
barcodes = strsplit(strsplit(cell, split = ".", fixed = TRUE)[[1]][3],
split = "_")[[1]]
row_bc = barcodes[2]
col_bc = barcodes[1]
#provides S_{gij} for the simplified model fit
# summed = sum(named_count_row[grepl(row_bc, names(named_count_row)) | grepl(col_bc, names(named_count_row))], na.rm = TRUE) - 2* named_count_row[names(named_count_row) == cell]
#provides S_{gij} - C_{gij} for the more complex model fit
#-2C for the non-self-counting, -1C for the term to fit beta to
summed = sum(named_count_row[grepl(row_bc, names(named_count_row)) | grepl(col_bc, names(named_count_row))], na.rm = TRUE) - 3* named_count_row[names(named_count_row) == cell]
return(summed)
}
#given a row of counts for each of the 2500 and 4000 (i.e. a gene), function prepares
#a data frame of cell, count_4000 = 4000 counts, count_2500 = 2500 counts, summed_2500 = row/col summed counts
# i.e. wraps get_summed() across cells
process_gene_on_plate = function(count_row_2500, count_row_4000){
summed = sapply(1:length(count_row_2500), function(x) get_summed(cell = names(count_row_2500[x]),
named_count_row = count_row_2500))
out = data.frame(cell = names(count_row_2500),
count_4000 = count_row_4000,
count_2500 = count_row_2500,
summed_2500 = summed)
return(out)
}
#Given plate-specific data frames, returns a data frame for each gene therein with,
# for each cell, the values of the 2500 and 4000 counts, and the summed 2500 row/column
#i.e. wraps process_gene_on_plate() across genes
process_plate = function(plate_meta, plate_4000, plate_2500){
if(any(rownames(plate_2500)!= rownames(plate_4000)))
stop("Genes not the same on both plates")
genes_4000 = split(as.data.frame(plate_4000), rownames(plate_4000))
genes_4000 = lapply(genes_4000, function(x) unlist(x[1,]))
genes_2500 = split(as.data.frame(plate_2500), rownames(plate_2500))
genes_2500 = lapply(genes_2500, function(x) unlist(x[1,]))
gene_dfs = lapply(1:length(genes_2500), function(x)
process_gene_on_plate(count_row_2500 = genes_2500[[x]],
count_row_4000 = genes_4000[[x]])
)
names(gene_dfs) = names(genes_2500)
return(gene_dfs)
}
#The function takes the full counts matrices, splits them by plate, and returns the
#dataframes ready to be modelled from, in a list
#all_meta needs $plate
#i.e. wraps process_plate() across all plates
do_all_plates = function(counts_4000, counts_2500, all_meta, min.mean = 50, n.cores = 3){
require(parallel)
keep_genes = which(rowMeans(counts_2500)>min.mean)
split_meta = split(all_meta, all_meta$plate)
split_4000 = lapply(split(as.data.frame(t(counts_4000[keep_genes,])), all_meta$plate), function(x) t(as.matrix(x)))
split_2500 = lapply(split(as.data.frame(t(counts_2500[keep_genes,])), all_meta$plate), function(x) t(as.matrix(x)))
#run over all plates
#Critically, this returns the column summed_2500, which is defined by get_summed()
# and will later be the predictor
dataframes = mclapply(seq_along(split_4000), function(x) process_plate(plate_meta = split_meta[[x]],
plate_4000 = split_4000[[x]],
plate_2500 = split_2500[[x]]),
mc.cores = n.cores)
#for each gene,make a master data frame
n.genes = length(dataframes[[1]])
gene_list = list()
for(gene in 1:n.genes){
#get the combined dataframe of the gene
gene_df = data.frame()
for(plate in 1:length(dataframes)){
gene_df = rbind(gene_df, dataframes[[plate]][[gene]])
}
gene_df$response = gene_df$count_4000
gene_df$predictor = gene_df$summed_2500
#sometimes we cannot get a predictor variable i.e. there is not any possibiltiy for swapping when
#the row/col sum= 0 (i.e. the gene does not exist for these cells)
#so we remove these now
# gene_df = gene_df[!is.nan(gene_df$predictor),]
gene_list[[length(gene_list)+1]] = gene_df
names(gene_list)[[length(gene_list)]] = names(dataframes[[1]])[gene]
}
return(gene_list)
}
#This function just makes models from its input list - from do_all_plates()
get_model_from_df_list = function(df_list){
models = lapply(df_list, function(x) lm(x$response ~ 0 + x$count_2500 + x$predictor))
return(models)
}
#takes the same list as get_model_from_df_list() but does the model comparison
get_anovas = function(df_list){
full_mod = lapply(df_list, function(x) lm(x$response ~ 0 + x$count_2500 + x$predictor))
small_mod = lapply(df_list, function(x) lm(x$response ~ 0 + x$count_2500))
return(lapply(1:length(df_list), function(i) anova(small_mod[[i]], full_mod[[i]])))
}
#If you want to speed up this code, increase the min.mean to 500 or even higher, so fewer genes are considered
gene_dfs = do_all_plates(counts_4000 = swap, counts_2500 = noswap, all_meta = blood_meta, min.mean = 50)
#for betas
models = get_model_from_df_list(gene_dfs)
anovas = get_anovas(gene_dfs)
#an alternative method vs. the Wald test
anova_ps = sapply(anovas, function(x) x$`Pr(>F)`[2])
anova_fdr = p.adjust(anova_ps, method = "fdr")
anova_frac = prop.table(table(p.adjust(anova_ps, method = "fdr")<0.05))[2]
gradients = sapply(models, function(x) coef(x)[2])
plot_df = data.frame(val = c("pos", "pos", "neg", "neg"),
sig = c(TRUE, FALSE, TRUE, FALSE),
values = c(sum(gradients>0 & anova_fdr<0.05),
sum(gradients>0 & anova_fdr>0.05),
sum(gradients<0 & anova_fdr<0.05),
sum(gradients<0 & anova_fdr>0.05)))
ggplot(plot_df, aes(x = factor(sig, levels = c(TRUE, FALSE), ordered = TRUE), y = values)) +
geom_bar(aes(fill = factor(val, levels = c("pos", "neg"), ordered = TRUE)), position = "dodge", stat = "identity") +
scale_x_discrete(labels = c("Model fit improvement\nwith swapping term",
"No model fit improvement\nwith swapping term"),
name = "") +
scale_y_continuous(name = "Number of genes") +
theme_bw() +
theme(panel.grid.major.x = element_blank(),
#panel.grid.minor.y = element_blank(),
axis.text = element_text(face = "bold", size = 10),
axis.title = element_text(size = 13, face = "bold")) +
scale_fill_manual(values = c("#23809C", "#7A1305"), name = "", labels = c("Coefficient sign supports swapping\n(beta > 0)", "Coefficient sign opposes swapping\n(beta < 0)"))
Figure 19: Gene-wise swapping model fits
Gene-wise model fits are shown, separated by whether the swapping model is preferred over the null model. Y-axis values correspond to the number of genes in each category.
support_frac = sum(gradients > 0 & anova_fdr < 0.05)/sum(anova_fdr < 0.05)
counts_lanes = as.matrix(read.csv(paste0(folder_location, "/data/wilson_2500.csv")))
counts_lanes = counts_lanes[rowSums(counts_lanes)>0,]
#split to the two sub_matrices
counts_1 = counts_lanes[,!grepl("_new", colnames(counts_lanes))]
counts_2 = counts_lanes[,grepl("_new", colnames(counts_lanes))]
colnames(counts_2) = gsub("_new", "", colnames(counts_2))
bc1s = sapply(strsplit(colnames(counts_1), "_"), function(x) x[1])
bc2s = sapply(strsplit(colnames(counts_1), "_"), function(x) x[2])
#for the two lane 2500
#match the name format to the other plates just for this
lane_1 = counts_1; colnames(lane_1) = paste0("some.name.", colnames(lane_1))
lane_2 = counts_2; colnames(lane_2) = paste0("some.name.", colnames(lane_2))
twolane_meta = data.frame(cell = c(colnames(lane_1)), plate = c(rep(1, ncol(lane_1))))
gene_dfs_twolane = do_all_plates(counts_4000 = lane_1, counts_2500 = lane_2, all_meta = twolane_meta, min.mean = 50, n.cores = 3)
models_twolane = get_model_from_df_list(gene_dfs_twolane)
anovas_twolane = get_anovas(gene_dfs_twolane)
anova_ps_twolane = sapply(anovas_twolane, function(x) x$`Pr(>F)`[2])
anova_frac_twolane = prop.table(table(p.adjust(anova_ps_twolane, method = "fdr")<0.05))[2]
90.5% of considered genes show a significant model fit improvement with the swapping term (FDR-corrected \(p<0.05\)). Of these genes, 97.9% have a positive value of \(\beta\), supporting the barcode swapping model.
This strongly suggests that barcode swapping is more prevalent on the HiSeq 4000 compared to the HiSeq 2500, and that it affects nearly all genes.
As a control for our model, we also applied it to data where identical library pools were sequenced on two lanes of a HiSeq 2500. Here, only 0.099% of genes have a significantly improved fit with the \(\beta\) term, as expected.
New single-cell RNAseq protocols utilise microfluidic systems to automate stages of library preparation by capturing individual cells in droplets. Each run of the microfluidic system generates a sample, which typically contains thousands of cells. These droplet-based protocols label their cells in a different manner to plate based assays, as a cell barcode (unique to each droplet) is incorporated into the transcript alongside an additional Illumina barcode that labels different sets of cells, or samples. Therefore, swapping of Illumina barcodes will move transcripts between samples, while retaining the same cell identifier (Figure 20).
knitr::include_graphics(paste0(folder_location, "/figs/10x_bcs.pdf"))
Figure 20: 10X barcode schematic
A 10X transcript contains multiple barcodes, including a 10X-supplied cell-labelling barcode, a randomly generated Unique Molecular Identifier (UMI), and an Illumina-supplied sample index. Only the sample index is expected to swap, leaving the cell barcode and UMI unchanged.
As discussed in the main text, swapping can have two effects depending on whether the cell barcodes are shared between the donor and recipient libraries. If they are shared, transcriptomes are homogenised. If they are not, artefactual cells may appear. Here, we investigate whether the barcode swapping phenomenon causes an excess of cell barcode sharing in droplet data.
The data we used were generated using the 10X Chromium system, where the manufacturers report that there are around 750,000 unique cell barcodes (Zheng et al. 2017). If we make the assumption that cell barcodes are drawn at random for each sample, this allows us to formulate a null model for the rate of sharing.
We can test for deviations in the observed rate of sharing against the null using a hypergeometric test. We test each pair of samples separately as particular samples may be more compromised by swapping than others.
#thanks to Karsten Bach for code
# Computation of p-values based on hypergeometric test
compare <- function(barcodes, samples) {
out <- NULL
samp_names = unique(samples)
samples <- as.numeric(as.factor(samples))
ids <- levels(factor(samples, levels = seq_along(unique(samples))))
combs <- combn(ids,m=2, simplify=FALSE)
for (i in seq_along(combs)) {
comb <- combs[[i]]
s1 <- comb[1]
s2 <- comb[2]
bc1 <- as.character(barcodes[samples==s1])
bc2 <- as.character(barcodes[samples==s2])
x <- length(intersect(bc1,bc2))
m <- length(bc1)
n <- 750000
k <- length(bc2)
p.val <- phyper(q=x-1,m=m,n=n,k=k,lower.tail=FALSE)
tmp <- data.frame(s1=s1,
s2=s2,
n1=m,
n2=k,
shared=x,
p=p.val)
out <- rbind(out,tmp)
}
out$s1 = samp_names[as.numeric(as.character(out$s1))]
out$s2 = samp_names[as.numeric(as.character(out$s2))]
return(out)
}
Notably, the sample indeces provided by 10X to label individual runs of the machine are actually a mixture of four different indices. Therefore it is possible for a barcode swap to occur between indices within the same sample (Figure 21); this will not be detectable after processing as all transcripts of the same sample are considered together, independent of which particular barcode from that sample was used. This makes this system inconvenient for estimating swapping rates. Instead, our analysis here focusses on the consequences of swapping in droplet data.
knitr::include_graphics(paste0(folder_location, "/figs/10x_swap.pdf"))
Figure 21: 10X barcode swapping
10X samples are labelled by sets of four barcodes. This means that it is possible for a transcript to swap its sample barcode to another within the same sample: after processing, it will not appear as though any swap has happened, because the read is still allocated to the same sample. Swaps to other samples may still occur, which may produce the artefacts described above.
drop_2500 = read.table(paste0(folder_location, "/data/tenx_2500_1.tab"), stringsAsFactors = FALSE)
split = strsplit(drop_2500[,1], "-")
barcodes = sapply(split, function(x) x[1])
samples = sapply(split, function(x) x[2])
Here, droplet data was generated from mouse embryonic cells using the 10X Genomics Chromium system. Samples were sequenced on lanes of a HiSeq 2500. 11 samples were multiplexed, varying in size between 3776 and 313 called cells. CellRanger 1.3.1 was used for sample processing with default arguments. Figure 22 shows p-values from the hypergeometric tests between every pair of samples.
# good_comp = pairwise_comparisons(barcodes = barcodes, samples = samples, n.cores = 3)
good_comp = compare(barcodes, samples)
hist(good_comp$p, xlab = "p", main = "HiSeq 2500, embryonic cells")
Figure 22: HiSeq 2500 sharing p-values
Hypergeometric test p-values between samples are shown for the HiSeq 2500, embryonic cell dataset.
if(any(p.adjust(good_comp$p, method = "fdr")!=1))
print("There are non-1 adjusted p-values")
None of the p-values for any comparison is significant after FDR correction at any significance level (all \(p=1\)). The high density of \(p=1\) values arise when two samples share no barcodes. Therefore we do not observe any excess of sharing in this data.
samp1 = read.table(paste0(folder_location, "/data/tenx_4000_1.tab"), header = T)
samp1_comp = compare(samp1$barcode, samp1$sample)
sig_row = which.min(samp1_comp$p)
This data was also generated using the 10X Genomics Chromium system, with four samples of human xenograft cells varying in size between 4608 and 1462 called cells. Libraries were sequenced on a HiSeq 4000, and data was processed using CellRanger 1.3.1 using default arguments.
Of the 9621 barcodes, 16 barcodes were observed twice, with none observed three times or more. Our pairwise comparison method identifies one significant excess of sharing after FDR correction (p = 0.0529). Here, 9 barcodes were shared compared to an expected value of 3.45 for samples of size 1771 and 1462.
Figure 23 shows p-values from the hypergeometric tests between every pair of samples.
hist(samp1_comp$p, main = "HiSeq 4000, xenograft cells", xlab = "Jaccard Similarity", breaks = 20)
Figure 23: HiSeq 4000 sharing p-values
Hypergeometric test p-values between samples are shown for the HiSeq 4000, xenograft cell dataset.
samp2_all = read.table(paste0(folder_location, "/data/tenx_4000_2.csv"), stringsAsFactors = FALSE, sep = ",", header = T)
samp2_all$barcode = sapply(strsplit(samp2_all$barcode,"-"), function(x) x[1])
samp2_all$quantile = NA
for(i in 1:nrow(samp2_all)){
samp = samp2_all$SampleID[i]
rem = samp2_all[-i,]
umis = rem$UmiSums[which(rem$SampleID == samp)]
samp2_all$quantile[i] = sum(umis<samp2_all$UmiSums[i])/length(umis)
}
samp2 = samp2_all[!samp2_all$SampleID%in%c("B1", "B2"),]
bad_comp = compare(samp2$barcode, as.numeric(factor(samp2$SampleID)))
We now apply the same set of pairwise comparisons to a dataset of mouse epithelial cells, once again processed using the 10X Chromium system, and sequenced on a HiSeq 4000. CellRanger 1.2.1 was used for data processing with default arguments. The six samples range in size between 1111 and 151 called cells. Two samples were excluded from analysis here due to failed library preparation. We observe low p-values in all pairwise comparisons (Figure 24).
hist(bad_comp$p, xlab = "p", xlim =c(0,0.01), breaks = 10, main = "HiSeq 4000, mouse epithelial cells")
Figure 24: HiSeq 4000 sharing p-values
Hypergeometric test p-values between samples are shown for the HiSeq 4000, mouse epithelial cell dataset. Note the x-axis scale, which differs from the other histograms above.
Despite this, the absolute rate of barcode sharing is still low: of 2950 barcodes, only 10 barcodes occur more than once - see the table below:
table(table(samp2$barcode))
##
## 1 2 4 5 6
## 2909 4 1 1 4
get_fraction = function(barcodes){
dup = unique(barcodes[duplicated(barcodes)])
return(sum(barcodes%in%dup)/length(barcodes))
}
fracs = c(get_fraction(samp2$barcode), get_fraction(samp1$barcode), get_fraction(barcodes))
In both datasets, we observe excess barcode sharing between samples sequenced on a HiSeq 4000. However, the actual number of shared cell barcodes remains low. We hypothesise that, due to the low rate of barcode swapping, swapped-in cell libraries (the potentially artefactual libraries) are very small compared to the libraries of real cells. Because the libraries are small, they are typically ignored by algorithms that call cell libraries, as these often depend on library sizes of individual cell barcode libraries - small libraries do not look like real cells. It is therefore plausible that droplet data has an intrinsic robustness to the generation of artefactual cells, at least when samples contain cells of comparable size, and are of high quality.
In the second HiSeq 4000 dataset (mouse epithelial cells), sample labelling combines experimental condition (A-D) with biological replicate number (1-2). Samples B1 and B2 possess considerably smaller library sizes (Figure 25), and share almost all of their cell barcodes with each other and with other samples that were multiplexed with them (Figure 26). Accordingly, samples B1 and B2 were excluded from the earlier analysis (Figure 24).
bc_tab = table(samp2_all$barcode)
ggplot(samp2_all, aes(x = factor(SampleID), y = UmiSums, fill = substr(SampleID, 1, 1))) +
geom_violin() +
scale_y_log10(labels = c("1,000", "10,000", "100,000"), breaks = 10^(3:5)) +
theme_bw() +
theme(legend.position = "none", panel.grid.minor.x = element_blank(), panel.grid.major.x = element_blank(), panel.grid.minor.y = element_blank(),
axis.text.y = element_text(size = 10), axis.text.x = element_text(size = 12, face = "bold"), axis.title = element_text(face = "bold", size = 15)) +
labs(x = "", y = "Library size") +
scale_fill_brewer(palette = "Set2")
Figure 25: Sample library sizes
Samples B1 and B2 have considerably smaller library sizes than other samples.
samp2_all$n = sapply(samp2_all$barcode, function(x) bc_tab[x])
samp2_all$n_cut = samp2_all$n>1 +1
b1_barcodes = samp2_all$barcode[samp2_all$SampleID =="B1"]
b2_barcodes = samp2_all$barcode[samp2_all$SampleID =="B2"]
samp2_all$in_b = samp2_all$barcode %in% b1_barcodes & samp2_all$barcode %in% b2_barcodes
ggplot(samp2_all, aes(x = SampleID)) +
geom_bar(aes(fill = factor(in_b)), stat = "count") +
scale_fill_manual(values = c("#E8CF76", "#383F70"), name = "", labels = c("Barcode not present\nin both B samples", "Barcode present\nin both B samples")) +
theme_bw() + theme(panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
axis.text.y = element_text(size = 10),
axis.text.x = element_text(size = 12, face = "bold"),
axis.title = element_text(face = "bold", size = 15)) +
labs(x = "Sample", y = "Number of called cells") +
scale_y_continuous(breaks = seq(0, 1250, 250), labels = c("0", "250", "500", "750", "1,000", "1,250"))
Figure 26: Barcode sharing between samples
Samples B1 and B2 share most of their barcodes, and many barcodes observed in other samples are also observed in samples B1 and B2.
sample_share = length(intersect(samp2_all$barcode[samp2_all$SampleID=="B1"],
samp2_all$barcode[samp2_all$SampleID=="B2"]))/
length(union(samp2_all$barcode[samp2_all$SampleID=="B1"],
samp2_all$barcode[samp2_all$SampleID=="B2"]))
fail_bc = samp2_all[grepl("B", samp2_all$SampleID), "barcode"]
samp2$swapped_in = samp2$barcode%in%fail_bc
test = wilcox.test(samp2$UmiSums[samp2$swapped_in], samp2$UmiSums[!samp2$swapped_in])
Samples B1 and B2 share 94% of all their samples with each other. One possible explanation is that:
Why have these small libraries been called as cells? This has been driven by CellRanger’s cell calling algorithm: it calls cells by “selecting 10X barcodes with a total UMI count of >= 10% of the 99th percentile of the expected recovered cells”.
Even if the 99th percentile value is low (as it was here), CellRanger will still identify cells regardless of whether libraries are either noisy, of poor quality, or derived from swapping. In this experiment, it has called the most cells in samples B1 and B2 despite the problems highlighted above.
If all of the called cells are consequences of barcode swapping, it follows that:
Based on this reasoning, it should be the largest cells from the high-quality samples that swapped into the low-quality samples. Given a constant rate of swapping, the cells with larger libraries will swap more transcripts. We can show that this is the case, comparing the library sizes of these two groups of cells: “Unique cell barcodes” corresponds to the cells whose barcodes are present only in the high-quality samples, while “Shared cell barcode” corresponds to cells present in both a high quality and compromised sample. Figure 27 shows that the latter have significantly more reads (p = 9.96e-259 at a ratio of 2.2 between the medians
ggplot(samp2, aes(x=swapped_in, y = UmiSums)) +
geom_boxplot() +
theme_bw() +
scale_y_log10(breaks = c(1000, 10000, 100000), labels = c("1,000", "10,000", "100,000"), name = "Library size", lim = c(min(samp2$UmiSums)*0.8, max(samp2$UmiSums)*1.5)) +
scale_x_discrete(labels = c("Unique cell\nbarcode", "Shared cell\nbarcode"), name = "") +
geom_signif(comparisons = list(c(1, 2)), annotations = "***") +
theme(panel.grid.major.x = element_blank(), axis.text.y = element_text(size = 10), axis.text.x = element_text(size = 12, face = "bold"), axis.title = element_text(face = "bold", size = 15))
Figure 27: Library sizes of shared and unshared cell barcodes, from high-quality samples
Cells that have their barcode shared between a high quality and poor quality sample (“Shared cell barcode”) are larger than those that do not (“Unique cell barcode”). This suggests that larger cells have swapped into the poor quality samples on account of their larger size - they can contribute larger absolute amounts of RNA.
This effect has been observed in data that already faced severely compromised samples (i.e. lysed cells). However, it is possible that the multiplexing of cells of radically different RNA content could present similar problems. We advise caution when combining experimental designs of this nature with HiSeq 4000 (or other patterned flow-cell) sequencing.
bctab = table(samp2$barcode)
remove = samp2[samp2$barcode %in% names(bctab)[bctab<3],]
removal_compare = compare(remove$barcode, as.numeric(factor(remove$SampleID)))
If we remove all of the barcodes from the data that are present in more than two of the high quality samples, we still observe an excess of sharing in one of our pairwise comparisons (p = 0.0561 after FDR correction).
We recommend using a test for the degree of barcode sharing between samples as a standard part of droplet-based single-cell sequencing quality control. The method we have presented above (hypergeometric test on pairwise comparisons) is fast and easy to apply.
If barcode swapping problems are observed, the easiest approach to remove the effects is to remove any cells labelled with a barcode that exists in more than one sample in each multiplexed sequencing run. This procedure will remove the artefacts, regardless of whether the cells truly exist in both samples (which will cause transcriptome mixing) or whether artefactual “fake cells” are being created.
The cost of such a procedure will be the loss of cells for downstream analysis, even in the absence of swapping. However, in a conventionally sized experiment (<10,000 cells per sample), the expected rate of removal is low.
We have simulated the random drawing of barcodes into eight multiplexed samples of various sizes. Figure 28 shows the rate of removal according to the exclusion method described above, assuming no creation of “fake cells”. For most experiments, the rate of cell loss seems easily tolerable in exchange for artefact removal.
do_sim_remove = function(sample_size = 10000, n.samples = 8, n.barcodes = 750e3){
samples = sapply(1:n.samples, function(x) sample(x = n.barcodes, size = sample_size))
dups = samples[base::duplicated(as.numeric(samples))]
removal_rate = sum(samples%in%dups)/length(samples)
return(removal_rate)
}
sizes = c(1000, 2000, 4000, 8000, 12000, 24000)
nsim = 100
sim_results = lapply(sizes, function(x) sapply(1:nsim, function(y) do_sim_remove(sample_size = x, n.samples = 8)))
size_column = factor(rep(sizes, nsim), levels = sizes, ordered = TRUE)
plot_df = data.frame(rate = unlist(sim_results),
size = size_column[order(size_column)])
ggplot(plot_df, mapping = aes(x = size, y = rate)) +
geom_boxplot() +
labs(x= "Cells per sample", y = "Removal rate") +
theme_bw() +
scale_y_continuous(breaks = seq(0, 0.2, 0.05), labels = paste0(seq(0, 20, 5), "%"))
Figure 28: Removal rate for experiments of different sizes
If duplicate cell barcodes are removed, there is a loss of power via cell exclusion. Plotted is the expected rate of removal for experiments of different sizes: 8 multiplexed samples, each of the size given on the x-axis are considered. These estimates assume no swapping.
We have suggested a method for dealing with the consequences of barcode swapping by excluding a subset of cells from droplet-based data. However, an alternative approach may be to work at the level of reads rather than at the level of individual cells.
For example, in a 10X Genomics single-cell experiment, transcripts are generated that contain:
These transcripts are extremely complex, as there are many orthogonal sources of variation between transcripts. It is extremely unlikely to observe a transcript containing the same gene sequence, the same UMI, and the same cell barcode in two different samples, unless it has derived from swapping. By excluding such sequences from the read files before proceeding alignment and read counting, it may be possible to exclude swapped transcripts without entirely removing the affected cells. However, accounting for possible sequencing errors in every transcript comparison made is not straightforward, and without sufficient transcript complexity many reads may be unnecessarily removed.
While other genomic assays may not offer the same degree of transcript complexity as droplet methods like the 10X Chromium system, it may be possible to introduce UMIs into the protocol pre-amplification, or to use other characteristics that encode complexity (e.g. alignment start and end sites, especially for paired end sequencing data).
It would be preferable to remove only swapped transcript, without affecting the levels of the original transcript from the swapping cell. If an amplification step is used before sequencing (which is common for single-cell methods), it is reasonable that the amount of each near-duplicate transcript described above should be much higher in the donor sample rather than in any to which it has swapped. Reciprocally, the copy number of the duplicate transcript should be much lower in any recipient samples to which it has swapped. It follows that the swapped transcripts may therefore be removed without affecting the levels of true donor transcript by excluding the reads only from samples where the transcript is much less common than one other sample.
figA = ggplot(df_4000, aes(x = tot, y = mapped)) +
geom_smooth(method = "lm", se = FALSE, col = "black", alpha = 0.3, lwd = 1.5) +
geom_point(col = "cornflowerblue") +
# scale_x_log10() + scale_y_log10() +
labs(x = "Available swapping reads", y = "Observed swapped reads") +
theme_bw() +
scale_x_continuous(breaks = c(5e6, 1e7, 1.5e7), labels = c("5,000,000", "10,000,000", "15,000,000")) +
scale_y_continuous(breaks = seq(2500, 10000, 2500), labels = c("2,500", "5,000", "7,500", "10,000"))+
theme(axis.text = element_text(face = "bold", size = 15, colour = "black"),
axis.title.y = element_text(size = 20, face = "bold"),
axis.title.x = element_text(size = 20, face = "bold"))
print(figA)
ggsave(figA, file = paste0(folder_location, "/figs/A.pdf"), width = 6, height = 5, useDingbats = FALSE)
plot_df = data.frame(val = c(const_adjusted[,2], const_adjusted[,1]),
share = c(rep("Cells\nsharing 1\nbarcode", nrow(const_adjusted)), rep("Cells\nsharing 0\nbarcodes", nrow(const_adjusted))))
set.seed(42*6)
figB = ggplot(data = plot_df,
mapping = aes(y = val, x = factor(share))) +
geom_boxplot( col = c("#B00029", "#EC6348") ) +
geom_jitter(width = 0.35, height = 0, col = ifelse(plot_df$share == plot_df$share[1], "grey30", "grey30")) +
theme_bw() +
theme(strip.background = element_blank(),
strip.text.x = element_blank(),
axis.text = element_text(face = "bold", size = 15, colour = "black"),
axis.title.y = element_text(size = 20, face = "bold"),
axis.title.x = element_text(size = 20, face = "bold"),
panel.grid.minor = element_blank()) +
labs(x = "", y= "Contribution to cell\ntranscriptomes") +
scale_y_continuous(breaks = seq(0, 0.05, by = 0.01), labels = paste0(seq(0, 0.05, by = 0.01)*100, "%"))
print(figB)
ggsave(figB, file = paste0(folder_location, "/figs/B.pdf"), width = 4, height = 5, useDingbats= FALSE)
For the boxplot, several points were moved laterally to ensure more easily readable jittering (hence there are slight differences to the manuscript version).
Illumina. 2017. “Effects of Index Misassignment on Multiplexing and Downstream Analysis.” [White Paper]. Accessed June 21. https://www.illumina.com/content/dam/illumina-marketing/documents/products/whitepapers/index-hopping-white-paper-770-2017-004.pdf?linkId=36607862.
Marioni, John C., Christopher E. Mason, Shrikant M. Mane, Matthew Stephens, and Yoav Gilad. 2008. “RNA-Seq: An Assessment of Technical Reproducibility and Comparison with Gene Expression Arrays.” Genome Research 18 (9): 1509–17. doi:10.1101/gr.079558.108.
Nestorowa, Sonia, Fiona K. Hamey, Blanca Pijuan Sala, Evangelia Diamanti, Mairi Shepherd, Elisa Laurenti, Nicola K. Wilson, David G. Kent, and Berthold Göttgens. 2016. “A Single Cell Resolution Map of Mouse Haematopoietic Stem and Progenitor Cell Differentiation.” Blood, January, blood–2016–05–716480. doi:10.1182/blood-2016-05-716480.
Picelli, Simone, Omid R. Faridani, Asa K. Bjorklund, Gosta Winberg, Sven Sagasser, and Rickard Sandberg. 2014. “Full-Length RNA-Seq from Single Cells Using Smart-Seq2.” Nature Protocols 9 (1): 171–81. doi:10.1038/nprot.2014.006.
Sinha, Rahul, Geoff Stanley, Gunsagar Singh Gulati, Camille Ezran, Kyle Joseph Travaglini, Eric Wei, Charles Kwok Fai Chan, et al. 2017. “Index Switching Causes ‘Spreading-Of-Signal’ Among Multiplexed Samples In Illumina HiSeq 4000 DNA Sequencing.” BioRxiv, April. doi:10.1101/125724.
Zheng, Grace X. Y., Jessica M. Terry, Phillip Belgrader, Paul Ryvkin, Zachary W. Bent, Ryan Wilson, Solongo B. Ziraldo, et al. 2017. “Massively Parallel Digital Transcriptional Profiling of Single Cells.” Nature Communications 8 (January): 14049. http://dx.doi.org/10.1038/ncomms14049.